A quick intro to the intro to R Lesson Series


This ‘Intro to R Lesson Series’ is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

This lesson is the third in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.


How do we get there? Today we are going to be learning how to make all sorts of plots - from simple data exploration to interactive plots.The next lesson will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. Then we will learn how to do t-tests and perfrom regression and modeling in R. And lastly, we will learn to write some functions, which really can save you time and help scale up your analyses.


The structure of the class is a code-along style. It is hands on. The lecture AND code we are going through are available on GitHub for download at https://github.com/eacton/CAGEF (Note: repo is private until approved), so you can spend the time coding and not taking notes. As we go along, there will be some challenge questions and multiple choice questions on Socrative. At the end of the class if you could please fill out a post-lesson survey (https://www.surveymonkey.com/r/VNQZ3KS), it will help me further develop this course and would be greatly appreciated.



Objective: At the end of this session you will be able to use ggplot2() to make a ton of different types of plots with your data for both for data exploration and for publication-quality figures.
***

Intro to the Grammar of Graphics

Grammar of graphics: a language to be able to communicate about what we are plotting programatically

(Figure borrowed from Paul Hiemstra)

(Figure borrowed from Paul Hiemstra)

This grammar allows very specific control over your plot and the ability to change features (easily). Plots can be scaled and made pretty enough for publication quality images.

ggplot2 was made to interact well with tidy (long) datasets. I have found that if someone is spending lots of time figuring out how to make a scatterplot, his or her data is probably not in the correct format for ggplot2.

The Grammar of Graphics in Action: Making Figures with ggplot2

Our Dataset

Metagenomic 16SrRNA sequencing of latrines from Tanzania and Vietnam at different depths (multiples of 20cm).

We have 2 csv files (change one to tsv or xlsx - maybe both and make an additional files and get a google spreadsheet): 1. A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables. 2. A table of species abundance.

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334


Scatterplot Example

I have saved a version of the OTU table in tidy format (which we created in Lesson 2 - split_dat). As well as a modified version of the metadata table.

Let’s read them in and also load ggplot2.

library(tidyverse)

dat <- read.csv("data/long_SPE_pitlatrine.csv", stringsAsFactors = F, header = TRUE)

Let’s build a plot by adding components one by one to see how the grammar of graphics is implemented in ggplot2. To start, we of course need to input our data. However, if that is all we input, what we get back is a blank graphic. We have not yet said what we want to plot and how we want to plot it.

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
dat <- read.csv("data/long_SPE_pitlatrine.csv", stringsAsFactors = F, header = TRUE)

We have, however, created a ggplot object. This is a list of 9 parameters: data, layers, scales, mapping, theme, coordinates, facet, plot environment, and labels. Luckily there are some defaults, so we don’t have to specify everything, but you can start to see how ggplot objects are highly customizable.

ggplot(ndat)

The next step is to choose the data we are plotting (aesthetics). At this point the data can be scaled and the axes appear. We have not yet specified how we want the data plot, just which data should be plotted. In practice, people usually omit ‘mapping =’, but it is a good reminder that mapping is, in fact, what we are doing.

str(ggplot(ndat))
List of 9
 $ data       :'data.frame':    81 obs. of  14 variables:
  ..$ Country       : Factor w/ 2 levels "Tanzania","Vietnam": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ Latrine_Number: int [1:81] 2 2 2 2 2 2 2 2 3 3 ...
  ..$ Depth         : int [1:81] 1 10 12 2 3 6 7 9 2 3 ...
  ..$ pH            : num [1:81] 7.82 9.08 8.84 6.49 6.46 7.69 7.48 7.6 7.55 7.68 ...
  ..$ Temp          : num [1:81] 25.1 24.2 25.1 29.6 27.9 28.7 29.8 25 28.8 28.9 ...
  ..$ TS            : num [1:81] 14.5 37.8 71.1 13.9 29.4 ...
  ..$ VS            : num [1:81] 71.33 31.52 5.94 64.93 26.85 ...
  ..$ VFA           : num [1:81] 71 2 1 3.7 27.5 1.5 1.1 1.1 30.9 24.2 ...
  ..$ CODt          : int [1:81] 874 102 35 389 161 57 107 62 384 372 ...
  ..$ CODs          : int [1:81] 311 9 4 180 35 3 9 8 57 57 ...
  ..$ perCODsbyt    : int [1:81] 36 9 10 46 22 6 8 13 15 15 ...
  ..$ NH4           : num [1:81] 3.3 1.2 0.5 6.2 2.4 0.8 0.7 0.9 21.6 20.4 ...
  ..$ Prot          : num [1:81] 35.4 18.4 0 29.3 19.4 0 14.1 7.6 33.1 44.5 ...
  ..$ Carbo         : num [1:81] 22 43 17 25 31 14 28 28 47 48 ...
 $ layers     : list()
 $ scales     :Classes 'ScalesList', 'ggproto' <ggproto object: Class ScalesList>
    add: function
    clone: function
    find: function
    get_scales: function
    has_scale: function
    input: function
    n: function
    non_position_scales: function
    scales: NULL
    super:  <ggproto object: Class ScalesList> 
 $ mapping    : list()
 $ theme      : list()
 $ coordinates:Classes 'CoordCartesian', 'Coord', 'ggproto' <ggproto object: Class CoordCartesian, Coord>
    aspect: function
    distance: function
    expand: TRUE
    is_linear: function
    labels: function
    limits: list
    range: function
    render_axis_h: function
    render_axis_v: function
    render_bg: function
    render_fg: function
    train: function
    transform: function
    super:  <ggproto object: Class CoordCartesian, Coord> 
 $ facet      :Classes 'FacetNull', 'Facet', 'ggproto' <ggproto object: Class FacetNull, Facet>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map: function
    map_data: function
    params: list
    render_back: function
    render_front: function
    render_panels: function
    setup_data: function
    setup_params: function
    shrink: TRUE
    train: function
    train_positions: function
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetNull, Facet> 
 $ plot_env   :<environment: R_GlobalEnv> 
 $ labels     : list()
 - attr(*, "class")= chr [1:2] "gg" "ggplot"

We now have chosen the geometric object (geom) with which to plot our data, in this case a point. A geom could be a line, a bar, a boxplot - you can type geom_ and then Tab to see all of the available options. Autocomplete can also be helpful for remembering syntax.

ggplot(ndat, mapping = aes(x = TS, y = CODt))

The data looks like there are two groupings. My guess would be that this is for the 2 different countries. We can easily test this by colouring our points by Country. A legend will be automatically created for you. I have also changed the shape and size of the geom. A quick reference key for shapes can be found in the ‘Cookbook for R’ (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/).

ggplot(ndat, aes(x=TS, y=CODt)) + geom_point() 

Some of our data points seem to be crushed near the x-axis. We can scale the y-axis to fix this. When we start customizing our plot, it easier to read our code if we have each specification on a new line.

ggplot(ndat, aes(x=TS, y=CODt, colour = Country)) + geom_point(shape = 18, size = 5) 

Faceting

Faceting allows us to split our data into groups. Note that I have removed the colour. It is good data visualization practice to only have one attribute (colour, shading, faceting, symbols) per grouping.

ggplot(ndat, aes(x=TS, y=CODt, colour = Country)) + 
  geom_point(alpha = 0.3) + 
  scale_y_log10()

I could now add information from another variable as a colour in this plot. Note that if a variable is continuous instead of discrete, the colour will be a gradient.

ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country)

If we really wanted to we could add another variable to the plot by changing the shape attribute. Note that shape can only be used for discrete values. Let’s change Country to having 2 shapes and facet by the Depth of the latrine instead.

ggplot(ndat, aes(x=TS, y=CODt, colour = Temp)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country)

We can now note that only Tanzania had sampling greater than 4cm of depth. There are single latrines for 4 samples. There was no latrine at a depth of 11cm. Lack of replication and a bias towards Tanzania for the higher depths is something we should keep in mind while looking at this data. Depending on the question we are trying to answer, we may want to remove some of this data. We can also note that our numbers are being ordered as if they were characters. If you recall we created Depth by splitting a sample name. We can change Depth to numeric type, and the numbers will be ordered properly.

ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth, scales = "free_y")

One thing that is not necessary in this case - but good to know about - is the ability to allow each grid to have its own independent axis scale. In our example, wells of Depths from 1-4cm have up to 1000 CODt, while the other wells barely have values past 100 CODt. This can be changed, but keep in mind most people will assume all grids have the same scale, so take extra care to point that the scales are different when presenting or publishing.

ndat$Depth <- as.numeric(ndat$Depth)
ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth)

Adding Regression Lines

You can also add statistical transformations to your plots. Again, take a look at stat_ then Tab to see the list of options. In this case let’s separately fit a linear regression line to CODt vs TS for each country. The grey area around the line is the confidence interval (default=0.95) and can be removed with the additional call to stat_smooth of ‘se = FALSE’.

ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth, scales = "free_y")

A linear model is not necessarily the best fit. The method of calculating the smoothing function can be changed to other provided functions or can be a custom formula. Note that I changed the confidence interval by modifying ‘level=0.8’. geoms can be made more transparent with the alpha parameter, which is set to 0.3 in the following code so that the emphasis is on the regression line rather than the points.

ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country) +
  stat_smooth(method = lm)

Exploring different types of plots

Density Plots

We are making a density plot for OTUs by Country. I have set alpha (transparency) to 0.3 so that we can see both countries on our plot.

ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point(alpha = 0.3) +
  scale_y_log10() +
  facet_grid(~Country) +
  stat_smooth(method = loess, level = 0.8)

The first thing to notice is that everything is clumped at 0. This is because we have not filtered our data frame to remove all observations where OTUs are zero. Here we filter to have at least 2 OTUs. The other thing to notice is that there is a long tail where there will only be a few observations. It will be necessary to change the x-axis to see our data. Note that R gives us a warning that we are not viewing 158 of our 4212 rows. We can add a rug geom to see each value.

ggplot(dat, aes(x=OTUs, fill=Country, alpha=0.3)) + 
    geom_density() 

Histograms

Histograms instead count the number of observations you have in each ‘bin’ that you specify. The default binwidth is 30. THIS HAS NOTHING TO DO WITH YOUR DATA!! CHANGE IT!! R will even warn you to change your binwidth.

Instead of having the countries information stacked, we may want to see the data side by side. This can be done with the parameter ‘position’ set to “dodge”. A rug geom can also be added to a histogram.

ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
    geom_histogram() +
  xlim(0,1000)


Challenge

Using the data filtered for OTUs greater than or equal to two, make a violin plot of the distribution of OTUs for each country. Use a log scale. Colour the violin plots by Country. Draw lines across the violin plot where the quantiles (25th, 50th, 75th) are for each plot. What do the widths of the plots represent?





Bar plots

Let’s create a bar plot of OTUs per country and use the filled in colour to represent Taxa.

ggplot(dat, aes(x=Country, y=OTUs, fill=Taxa)) + 
    geom_bar() 

This is a common error because the default for geom_bar is to use the y-axis for a count. To use it for a variable instead, we have to specify stat=“identity”.

ggplot(dat, aes(x=Country, y=OTUs, fill=Taxa)) + 
    geom_bar() 
Error: stat_count() must not be used with a y aesthetic.

Controlling the order of your categorical variables in your legend

In the legend for the diagram above our factor levels for Taxa are in alphabetical order. However, an ordering that might be more useful would be the order that matches our data, for example, Taxa in descending order of OTUs. To do this, we can use the forcats package. fct_reorder2() takes the factor we want to order by (Taxa) and orders it by the values given (Country, OTUs). We will talk about customizing plots later in this lesson, but the last line of code is one way to remove the legend title (by making the legend title ‘blank’).

Now our highest abundance Taxa is the first value in our legend, and this matches the order of the Taxa in our bar graph. The legend order now has meaning. The white line in the Vietnam bar graph is a Taxa for which there was an OTU value in Tanzania but no data in Vietnam (since anything <=2 was filtered out of the data set).

Let’s take a second to look at what happens when stat does not equal ‘identity’.

What is being counted instead of OTUs? Why does the colouring for Taxa look so different?

You can alternate between stacked or “dodged” for whether your bars are on top of each other or next to each other (split by a factor or categorical variable) or on top.

ggplot(dat, aes(x=Country, fill=Taxa)) + 
    geom_bar() 

You can have your bars horizonally instead of verstically by using coord_flip()l.

ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
    geom_bar(stat="identity", position = "dodge") 

Circular Plots

In ggplot, circular plots are related to bar graphs - they just have different coordinate systems. The default coordinate system is cartesian coordinates, and we need to switch to polar coordinates to make a circle. This is a Coxcomb plot - pH levels increase as you move up the outer rings, and depth increases as you move clockwise around the circle. Colour is still represented by country.

ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
    geom_bar(stat="identity", position = "dodge") +
  coord_flip()

To make your classic venn diagram, use theta to specify what variable is going to be used to make up the angles (width of pie slices). To wrap to a full circle instead of having sections, the width is set to one.

ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
    geom_bar(stat="identity", position = "dodge") +
  coord_polar()

Lines

To draw a line graph, we select geom_line().

ggplot(dat, aes(x="", y=Country, fill = Country)) + 
    geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y")

We can colour the lines for the different countries. Looking at our previous graph, what must have happened when these values were over the same range?

ggplot(ndat, aes(x=Temp, y=pH)) + geom_line() 

Error bars

To plot error bars, we need to give the geom, geom_errorbar(), the interval over which we want to draw the bar. This is usually something like one standard deviation or one standard error or a confidence interval. Luckily, we have learned how to calculate the mean and standard deviation with dplyr. Then, all we need to do is add or subtract the standard deviation from our data point to get the bounds of the error bar. geom_errorbar() takes these bounds passed to the parameters ymax and ymin. In this case, since there were no sample replicates, the standard deviation is taken from the mean of all of the soil readings in a given country at a given depth.

ggplot(ndat, aes(x=Temp, y=pH, colour = Country)) + geom_line() 

Stripplots

The first plots we made we scatterplots with 2 continuous variables. With one discrete variable and one categorical variable, we can make a stripplot.

errdat <- ndat %>% group_by(Country, Depth) %>% mutate(mean_pH = mean(pH)) %>% mutate(sd_pH= sd(pH))
ggplot(errdat, aes(x=Temp, y=pH, colour = Country)) + 
  geom_line() + 
  geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = .2, alpha = 0.4)

Again, we have a lot of values crushed near the x-axis. It we add log scaling to the y-axis, we get an error that this transformation created infinite values. This is because we have zeros in our data set. Knowing this, we could ignore the warning, or add +1 to each OTU.

ggplot(dat, aes(x = Depth, y = OTUs)) + geom_point() 

ggplot(dat, aes(x = Depth, y = OTUs)) + geom_point() + scale_y_log10() 

In order to see the points a little better, we can ‘jitter’ them.

ggplot(dat, aes(x = Depth, y = (OTUs+1))) + geom_point() + scale_y_log10() 

Bubble Plots + Labels

Let’s group our OTUs by Country and Taxa and look at the Taxa for the top 30 OTUs.

With bubble plots, we need to provide the size of our bubbles in a way that is proportional to our data, and provide a scale to map these to.

We are going to make our points with geom_jitter so that our bubbles don’t overlap. Remember that we want to specify what we are plotting with aesthetics. We want the bubbles representing OTUs_per_Taxa, so we divide each value by pi when calling size since bubbles are circles.

ggplot(dat, aes(x = Depth, y = (OTUs+1))) + geom_point(position = "jitter") + scale_y_log10() + facet_wrap(~Country)

However, it is hard to tell proportionally how different these sizes are. Mapping values to a scale gives us more of an idea of their relative sizes.

bdat <- dat %>% select(-Latrine_Number, -Depth) %>% group_by(Taxa, Country) %>% mutate(OTUs_per_Taxa = sum(OTUs)) %>% select(-OTUs)
bdat <- unique(bdat) %>% arrange(desc(OTUs_per_Taxa)) %>% .[1:30,]
ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = TRUE)  +
  guides(fill = FALSE)

ggplot2 provides 2 methods of labelling. Let’s look at the help menu to find out what the difference between the 2 of them is. We have to specify in with aesthetics, that we are plotting our label.

ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  guides(fill=FALSE)

This isn’t fantastic because our labels are overlapping. What parameters might you be able to use to move the labels? Try using them to get the labels to not overlap.

geom_text has an option to check for overlap of labels. geom_text does not provide a background for the label, and it may be harder to tell which label belongs to each data point.

ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_label(aes(label = Taxa), show.legend = FALSE)

What happened here? Why don’t we have as many labels? Look in the help menu to explain this behaviour.

I don’t like futzing with label positions, so I went looking for a package that would do this for me. ggrepel() will ‘repel’ your labels away from each other without getting rid of them. Let’s check it out with our bubble plot labels. Install and load ggrepel. The equivalent function we can use is geom_label_repel(). The force parameter allow you to modify how far you want your labels pushed away from each other. Here, colour is being used to map to our data points but ggrepel can ditch the box and connect a line to data points instead. Variations on use can be found here: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html.

ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_text(aes(label = Taxa), check_overlap = TRUE, show.legend = FALSE)

Boxplots

Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of your box are the first and third quartiles or 25th and 75th percentiles of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.

We are going to see boxplots for the distribution of OTUs per Taxa across all samples.

library(ggrepel)
ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_label_repel(aes(label = Taxa), force = 2,show.legend = FALSE, segment.size=0)

While it is clear that Clostridia is the most represented taxa, it is difficult to tell whether some other taxa have no representation, or if they are lowly represented. Transforming to a log scale on the y-axis will sort this out for us.

ggplot(dat, aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1))

We could also clean this plot up a bit by removing those taxa with 1 or fewer OTUs. And have the condition that it must occur in more than 1 sample.

ggplot(dat, aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10()

If we facet by country we can see that Acidobacteria_Gp1, for example, was only found in Vietnam and not Tanzania.

keep <- dat %>% ungroup() %>% filter(OTUs>=2) %>% group_by(Taxa) %>% summarize(n = n()) %>% filter(n > 1) %>% select(Taxa)
ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10()

For now, let’s just keep the Taxa that are common to both countries.

ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10() + facet_grid(~Country)

Beeswarm Plots

Even though boxplots give us summary statistics on our data, it is useful to be able to see where our data points are. Adding the data using geom_point() adds our data in the form of a stripplot to our boxplots. It would be nice to see how many points we have at each OTU value, but there are a lot of points overlapping here.

keep <- dat %>% 
  ungroup() %>% 
  filter(OTUs>=2) %>% 
  group_by(Country, Taxa) %>% 
  summarize(n = n()) %>% 
  filter(n > 1) 
keep$dup <- keep %>% ungroup() %>% select(Taxa) %>% duplicated(.)
keep <- keep[keep$dup, "Taxa"]
ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10() + facet_grid(~Country)

Jittering the points gives a general idea, but is a bit too widely dispursed to give a good sense of the data.

ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_point() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()

A beeswarm plot places data points that were overlapping instead next to each other, so we can get a better picture of the distribution of points. We simply overlay the points with geom_beeswarm().

ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_jitter() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()

Increasing the spacing between data points can make this distribution a bit clearer.

library(ggbeeswarm)
ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_beeswarm() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()

Using geom_quasirandom() gives the empirical distribution of the stripplot to avoid overplotting.

ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_beeswarm(cex = 2.2) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()

Other spacing and distribution options are available at https://github.com/eclarke/ggbeeswarm.

A Note on Colour Palettes

There are 3 main types of colour palettes. 1. Sequential - implies an order to your data ie. light to dark implies low values to high values. 2. Diverging - low and high values are extremes, and the middle values are important still goes from light to dark, but 3 colour mainly used. 3. Qualitative - there is no quantitative relationship between colours. This is usually used for categorical data.

Which of these is ggplot2’s default color palette?

Many colour palettes now exist. I showcase a couple that work nicely with ggplot2. These packages also have colorblind friendly options. RColorBrew has options for these 3 types of palettes, which you can see with display.brewer.all(). With a smaller dataset, we could make a call to scale_fill_brewer(), which just requires a choice of one of RColorBrewer’s palettes, such as “Spectral”. However, we have 24 Taxa and these palettes have 8-12 colours, so we have to get creative. I have simply taken the 2 qualitative palettes that each have a length of 12, put them into one palette, and made sure my values were unique. This can then be passed to ggplot via scale_fill_manual(). You can always choose a vector of your own colors using this ‘R color cheatsheet’(https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf).

labels <- c(T = "Tanzania", V = "Vietnam")
p <- ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs, fill = Taxa)) + 
  geom_boxplot(outlier.colour = "red") + 
  scale_y_log10() + 
  facet_grid(~Country, labeller = labeller(Country = labels)) +
  ggtitle("Abundance of Taxa by Country") +
  xlab(NULL) +
  ylab("log(OTUs)") + 
  guides(fill=FALSE)
p

The viridis package also has some nice color palettes (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). I think they might all be diverging palettes (qualitative is best for our Taxa), but I will showcase a couple here.

library(RColorBrewer)
library(viridis)
Loading required package: viridisLite
display.brewer.all()

#p + scale_fill_brewer(palette = "Spectral")
#you can also use the names of colours
#scale_fill_manual(values=c("purple", "cornflowerblue", "grey", "yellow", "orange", "brown"))
palette1 <- brewer.pal(12, "Paired")
palette2 <- brewer.pal(12, "Set3")
length(unique(c(palette, palette2)))
[1] 13
custom <- c(palette1, palette2)
p + scale_fill_manual(values = custom)

p + scale_fill_viridis(discrete = TRUE)

p + scale_fill_viridis(discrete = TRUE, option = "plasma")

RSkittleBrewer is another option for funky colour palettes. ggsci has a variety of color palettes inspired by different scientific journals as well as television shows (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html).

Customization of Theme

As mentioned earlier, it is possible to customize every single aspect of a ggplot. Most of this occurs with a call to theme, which you can think of as modifying everything BUT your data. For example, my axis labels can be modified, but they (hopefully) have something to do with my data. However, changing the size of the text or the font of the labels is unrelated to my data, and that same structure could be carried over to other plots if I saved my own theme.

Things that you can change with theme include the axis, legend, panels, gridlines, or background.

Each element of theme inherits from one of:

ggplot comes with some themes - so I would suggest starting with the one that is closest to what you want, and start modifying from there.

Check out these themes: - theme_minimal() - theme_classic() - theme_bw() - theme_void() - theme_dark() - theme_gray() - theme_light()

You can look at the default for each theme simply by typing it into the console.

library(viridis)
p + scale_fill_viridis(discrete = TRUE)

p + scale_fill_viridis(discrete = TRUE, option = "plasma")

And this is what theme_bw() looks like:

theme_bw
function (base_size = 11, base_family = "") 
{
    theme_grey(base_size = base_size, base_family = base_family) %+replace% 
        theme(panel.background = element_rect(fill = "white", 
            colour = NA), panel.border = element_rect(fill = NA, 
            colour = "grey20"), panel.grid.major = element_line(colour = "grey92"), 
            panel.grid.minor = element_line(colour = "grey92", 
                size = 0.25), strip.background = element_rect(fill = "grey85", 
                colour = "grey20"), legend.key = element_rect(fill = "white", 
                colour = NA), complete = TRUE)
}
<environment: namespace:ggplot2>

Here is an example of theme_dark.

p + theme_bw()

There are also themes that have been created and put into a package, ggthemes. Some of these themes are based off of graphs seen in print or on websites (the economist, wall street journal, fivethirtyeight) or to match standard tools (excel, google docs). Information about the themes can be found at https://github.com/jrnold/ggthemes.

p + theme_dark()

library(ggthemes)
p + theme_economist()

You can also make your own custom theme as demoed here: http://joeystanley.com/blog/custom-themes-in-ggplot2

I am going to show you how to modify a plot, starting from theme_minimal() because I don’t like the grey backgrounds or harsh axis lines.

p + theme_stata()

Things I don’t like about this plot and their solutions:

|Problem | Solution | |_________________________________________________________:|:__________________________________________________________________________:| |x-axis labels overlap | rotate lables: axis.text.x = element_text(angle =90, hjust=1) | |country labels are smaller than axis labels | change size and face: strip.text.x = element_text(face = “bold”, size = 16)| |title is uncentered | adjust horizontally: plot.title = element_text(hjust=0.5, size = 18) | |border to separate countries | create a border: panel.border = element_rect(fill = NA) | |make y axis ticks | create y axis ticks: axis.ticks.y = element_line() |

It isn’t necessary to remember all of this syntax, I frequently find myself back at the help page at http://ggplot2.tidyverse.org/reference/theme.html.

The last call to theme will override previous calls. Therefore, if we want to start with theme_minimal() as our base, it has to be in our code BEFORE the other modifications.

p + theme_minimal()


Challenge

Change the colour of the plot background. Add minor gridlines, and make the major gridlines black. Increase the spacing in between the panels.





Correlation Plots and Heatmaps

One common way to look for outliers or trends in data is to calculate the correlation between samples and make a correlation plot. Since samples are expected to be in columns, this is an example of where you would use ‘wide’ data. The default correlation coefficient is the Pearson correlation, but others (Kendall, Spearman) can be specified. Note that the entire matrix must be numeric.

A heatmap can be made in ggplot by using the tile geom. The fill is now the correlation coefficient, which is a continuous variable (hence the gradient). I have changed the title of the legend from ‘value’ to ‘Pearson Correlation’ by using scale_fill_continuous(). The ‘’ included in the name is a line break. A fantastic resource for modifying legends is http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/.

ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs, fill = Taxa)) + 
  geom_boxplot(outlier.colour = "red") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle =90, hjust=1), 
        panel.border = element_rect(fill = NA), 
        strip.text.x = element_text(face = "bold", size = 16), 
        plot.title = element_text(hjust=0.5, size = 18), 
        axis.ticks.y = element_line()) + 
  scale_y_log10() + 
  facet_grid(~Country, labeller = labeller(Country = labels)) +
  ggtitle("Abundance of Taxa by Country") +
  xlab(NULL) +
  ylab("log(OTUs)") + 
  guides(fill=FALSE)

cdat <- read.csv("data/ENV_pitlatrine.csv", row.names = 1)
cdat <- cor(cdat)
library(reshape2)

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

Error in names(frame) <- `*vtmp*` : names() applied to a non-vector
melt <-melt(cdat)
ggplot(melt, aes(Var1, Var2, fill = value)) + 
  geom_tile() +
  xlab(NULL) +
  ylab(NULL) +
  scale_fill_continuous(name = "Pearson\nCorrelation")

If this looks a little jarring to you compared to ‘normal’ heatmaps, it is probably because the data has not been ordered yet. This can be done by calculating the distance between samples (ie. euclidean) and them clustering them in a hierarchical way (ie. complete linkage).

To have a different colour scheme that for example, goes from red to yellow, you can use scale_fill_gradient() and specify what you want the diverging colours to be. Note that your legend title now derives from this gradient.

library(reshape2)

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

We can now see that there is a patch of dark red samples with low correlation. It is difficult to see which sites these are as the names are a bit small.

Enter the interactive heatmap! Plotly allows for interactive plots in a variety of languages, and allows the creation of sharable links (if you are interested you have to sign up for a free account to get API credentials https://plot.ly/r/getting-started/). They have created a package, which, luckily for us, interfaces seamlessly with ggplot2. I only need to save my plot and call the function ggplotly() to have an interactive graph!

The response of plotly to manipulations other than hovering can be a bit slow if not using the API and may freeze/crash R, so I recommend saving your code before trying it.

cdat <- cdat[hc$order, hc$order]
Error: object 'hc' not found

Hovering will highlight the variables and values relating to that specific data point, and areas can be zoomed in on. We can now see the names of the low correlation samples. While they are all from Tanzania, they do not appear to all be from the same Latrine_Number or Depth.

ggvis is a Bioconductor package allowing interactive graphics in R (http://ggvis.rstudio.com/). It is founded on the grammar of graphics and the syntax is similar, but not identical to ggplot2. It is good for interactive scatterplots, bar graphs, line graphs and histograms (although not heatmaps at this point in time). The nature of the interactiveness is highly customizable: you can hover, add slider bars, checkboxes or radiobuttons (it uses Shiny’s reactive programming).

In this example I have just taken a scatterplot that we made earlier, and made it possible to hover over a point and see its underlying data. Don’t worry about the function and syntax too much - those will be introduced in later lessons. I just wanted to show you another option for interactive visualization that hopefully isn’t too intimidating.

library(plotly)

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout

Saving in ggplot2

ggplot2 has its own function for saving ggsave().

You can send the plot object to the graphics device and then save that image.

With ggsave you can minimally input the filename you would like to have, and the path to your file (you can instead combine the path and filename in the filename).

library(ggvis)

Attaching package: ‘ggvis’

The following objects are masked from ‘package:plotly’:

    add_data, hide_legend

The following object is masked from ‘package:ggplot2’:

    resolution
p <- ggvis(ndat, x= ~TS, y= ~CODt) 
  
layer_points(p) %>% add_tooltip(function(x) c(paste0("TS=", x$TS, ", CODT=", x$CODt)))
Loading required package: shiny
Showing dynamic visualisation. Press Escape/Ctrl + C to stop.

However, in some cases you want to tailor your output. You can specify the width, height and units of your image, or you can apply a scaling factor. You can also specify the plot object you want to save instead of whatever was on your graphics device last using the ‘plot’ parameter. ggsave() can save in a variety of file extensions including pdf, jpeg, tiff, bmp, svg or png.

Taking it up a notch

Multiple plots on one page (ie. for publication images)

There are a variety of methods to mix multiple graphs on the same page, however ggplot does not work well with all of them. I am going to work with a package based that uses gridExtra(which allows us to arrange plots) but works well with ggplot called ggpubr (which allows us to align the axes of our plots). For a demonstration, we are going to take 3 plots that we made earlier (a boxplot, a histogram, and a bubble plot), save them as objects, and then arrange and align them in the same figure. (http://www.sthda.com/english/rpkgs/ggpubr/)

There is a function ggarrange() which takes your plots, their labels, and how they are arranged in rows and columns. To start, if we just have our boxplot and bubble plot, and we wanted them to be side by side, if you picture each plot as a square in a grid, we would have two columns ncol = 2 and 1 row

While our grid areas are the same size, our plot backgrounds are not. Let’s adjust the legend so that it is in the top right corner of the plot, and get remove the white background. We do not need the alpha legend. Do you remember how to get rid of it?

ggarrange(bub, hist, 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)
Removed 158 rows containing non-finite values (stat_bin).Removed 3 rows containing missing values (geom_bar).

Imagine a square with 4 boxes. We are going to put our boxplot in the left top and bottom boxes, the histogram in the top right box, and the bubble plot in the bottom right box. To do this, we are arranging 2 columns (one with the boxplot and one with the histogram + boxplot) and we are arranging 2 rows (one with the histogram and one with the boxplot).

ggarrange(bub, hist, 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)
Removed 158 rows containing non-finite values (stat_bin).Removed 3 rows containing missing values (geom_bar).

The y-axis lines for the histogram and boxplot are not aligned, but this can be fixed with a call to align="v".

ggarrange(box, ggarrange(hist, bub, 
          labels = c("B", "C"),
          nrow = 2),
           ncol = 2, labels = "A")
Removed 158 rows containing non-finite values (stat_bin).Removed 3 rows containing missing values (geom_bar).Transformation introduced infinite values in continuous y-axisTransformation introduced infinite values in continuous y-axisRemoved 7 rows containing non-finite values (stat_boxplot).

To make sure all axis titles are the same size (B and C look a little off to me), we can specify the font we want changed and the size we want to change it to. I am also going to make the legend title size the same. I want to make the bubble plot font a little smaller as well. Let’s look at the font function. Can I do that? What might be another option?

ggarrange(box, ggarrange(hist, bub, 
          labels = c("B", "C"),
          nrow = 2, align = "v"),
           ncol = 2, labels = "A")
Removed 158 rows containing non-finite values (stat_bin).Removed 3 rows containing missing values (geom_bar).Transformation introduced infinite values in continuous y-axisTransformation introduced infinite values in continuous y-axisRemoved 7 rows containing non-finite values (stat_boxplot).


Challenge

Using the code available here for a line graph we made earlier in the lesson, make a combined figure. Imagine a square with 4 boxes. We are going to put our line graph in the top right and left boxes, the histogram in the bottom left box, and the bubble plot in the bottom right box. Make sure the legend and axis titles are the same size. Change the legend text for the line graph to be smaller than the legend title.

lin <- ggplot(errdat, aes(x=Temp, y=pH, colour = Country)) + 
  geom_line() + 
  geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = .2, alpha = 0.4)





There are many fantastic R packages to analyze and visualize your data. As a group, we are likely working on a variety of specialized areas. The plots we have made so far today should be useful for data exploration for many different kinds of data. In the next section we are going to preview some more complex visualization types, but since these take more time to go through and not everyone may be interested in network diagrams, time-series analysis, or geospatial data, we will not be plotting all of these ourselves. If you are interested in a tutorial on one of these visualization types, please indicate that in the comments area of the lesson survey (https://www.surveymonkey.com/r/VNQZ3KS).

We will, however do an Upset Plot as my personal crusade to have something better than venn diagrams in publications.

Interactive graphics

Plotly: - https://plot.ly/r/

ggvis: - http://ggvis.rstudio.com/interactivity.html

Heatmaps: - https://github.com/talgalili/heatmaply

Interactive time-series data: - https://rstudio.github.io/dygraphs/ ###Network diagrams visNetwork (based on igraph) https://datastorm-open.github.io/visNetwork/edges.html

Upset Plots

Upset plots are an alternative to venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, venn diagrams can be difficult to interpret for greater than 2 or 3 sets.

Let’s see how this works practically. The question that we are asking is how many sites are each Taxa represented at and what is the overlap with other Taxa? The data that we have is an OTU table. For this purpose we simply want to know whether a given Taxa was present or absent in the sample, and then we can form intersections based on this information. So, for each OTU value that is not 0, we replace it with a 1 instead. Then all we have to do is to enter our data set, the number of sets we are inputting, if we want to order the results (in this case by frequency), and how many intersections we want to show. I have picked 8 different Taxa, which would make a crazy venn diagram with a lot of zeros in it since a couple of these Taxa are rare. Here, I will show 15 intersections - we know the remaining intersections would be zero since this is ordered by frequency.

ggarrange(lin+ font("axis.title", size=9) + font("legend.title", size=9) + font("legend.text", size = 8), ggarrange(hist + font("axis.title", size=9) + font("legend.title", size=9), bub + font("axis.title", size=9), 
          labels = c("B", "C"),
          ncol = 2, align = "h"),
           nrow = 2, labels = "A")
Removed 158 rows containing non-finite values (stat_bin).Removed 3 rows containing missing values (geom_bar).Removed 4 rows containing missing values (geom_errorbar).

Let’s look at our greatest Intersection Size (equal to 28). This means that 28 of our 81 samples have Clostridia, Alphaproteobacteria, and Deinococci present in the same sample WITHOUT Chloroflexi, both Acidobacteria, Cyanobacteria or Lentisphaeria also being present. We can see from the Set Size that Clostridia and Alphaproteobacteria are present in almost all samples, and Deinococci is present in greater than 3/4 of the samples. You might expect this overlap to be higher, but remember that this is without the other Taxa and moving along our interesection sizes and dot matrix, we can see that other intersections include these bacteria. Some quick mental math shows that 63 samples have these 3 bacteria present.

Genomics Data

ggbio: - http://www.bioconductor.org/packages/2.11/bioc/vignettes/ggbio/inst/doc/ggbio.pdf

GenVisR: - https://bioconductor.org/packages/release/bioc/vignettes/GenVisR/inst/doc/GenVisR_intro.html

GenomeGraphs: - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-2


Challenge

Install and load the gapminder package. Save the gapminder data into an object using dat <- gapminder. Plot the following:

  1. a qualitative variable with a quantitative variable (at least 2 ways).
    • Facet your plot by the qualitative variable.
  2. two quantitative variables (at least 2 ways).
    • Add a regression line to your plot.





Resources:

Wickham, Hadley. (2010). A Layered Grammar of Graphics. Journal of Computational and Statistical Graphics.
Wilkinson, L. (2005), The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer. [14, 18]
http://www.cookbook-r.com/Graphs/
https://github.com/jennybc/ggplot2-tutorial
Tufte, Edward R. The Visual Display of Quantitative Information.
http://stcorp.nl/R_course/tutorial_ggplot2.html
http://ggplot2.tidyverse.org/reference/theme.html http://joeystanley.com/blog/custom-themes-in-ggplot2 https://github.com/jrnold/ggthemes https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/ https://github.com/eclarke/ggbeeswarm https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html http://www.sthda.com/english/rpkgs/ggpubr/ https://rpubs.com/drsong/9575 http://elpub.bib.uni-wuppertal.de/edocs/dokumente/fbb/wirtschaftswissenschaft/sdp/sdp15/sdp15006.pdf http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/

Post-Lesson Assessment


Your feedback is essential to help the next cohort of trainees. Please take a minute to complete the following short survey: https://www.surveymonkey.com/r/VNQZ3KS




Thanks for coming!!!

---
title: "Lesson 3 - Plot all the things! From Data Exploration to Publication-Quality Figures"
output: 
  html_document:
          keep_md: yes
          toc: TRUE
          toc_depth: 4
  html_notebook:
          toc: TRUE
          toc_depth: 4
---
***
</br>

##A quick intro to the intro to R Lesson Series

</br>

This 'Intro to R Lesson Series' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology. 



This lesson is the third in a 6-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results. 


![](img/data-science-explore.png)

</br>

How do we get there? Today we are going to be learning how to make all sorts of plots - from simple data exploration to interactive plots.The next lesson will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. Then we will learn how to do t-tests and perfrom regression and modeling in R. And lastly, we will learn to write some functions, which really can save you time and help scale up your analyses.


![](img/spotify-howtobuildmvp.gif)

</br>

The structure of the class is a code-along style. It is hands on. The lecture AND code we are going through are available on GitHub for download at https://github.com/eacton/CAGEF __(Note: repo is private until approved)__, so you can spend the time coding and not taking notes. As we go along, there will be some challenge questions and multiple choice questions on Socrative. At the end of the class if you could please fill out a post-lesson survey (https://www.surveymonkey.com/r/VNQZ3KS), it will help me further develop this course and would be greatly appreciated. 

</br>


***
__Objective:__ At the end of this session you will be able to use ggplot2() to make a ton of different types of plots with your data for both for data exploration and for publication-quality figures.  
***

##Intro to the Grammar of Graphics

**Grammar of graphics:** a language to be able to communicate about what we are plotting programatically 

![(Figure borrowed from Paul Hiemstra)](img/ggplot2.png)

* The *aesthetics* are what you can see. For example, the data being plotted or represented by a shape or colour.
This data could be presented in multiple ways. Data is mapped to aesthetics. A plot can have multiple layers (for example,
a scatterplot with a regression line).

* Lines, bars, and points are examples of *geometric objects (geoms)* that could be used to present the data.

* The data units need to be converted to physical units in order to be displayed. This uses *scaling* and a *coordinate system* (position adjustment for where is our pixel going). Other *statistical transformations* can also occur, such as aggregating data, jittering, density estimates, a boxplot and binning. 

* *Facetting* allows us to plot subsets of the data.

This grammar allows very specific control over your plot and the ability to change features (easily). Plots can be scaled and made
pretty enough for publication quality images.

ggplot2 was made to interact well with tidy (long) datasets. I have found that if someone is spending lots of time figuring out how to make a scatterplot, his or her data is probably not in the correct format for ggplot2.


##The Grammar of Graphics in Action: Making Figures with ggplot2

###Our Dataset

Metagenomic 16SrRNA sequencing of latrines from Tanzania and Vietnam at different depths (multiples of 20cm). 

We have 2 csv files (change one to tsv or xlsx - maybe both and make an additional files and get a google spreadsheet): 
1. A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.
2. A table of species abundance.

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince.
Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines
Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334

***

###Scatterplot Example

I have saved a version of the OTU table in tidy format (which we created in Lesson 2 - split_dat). As well as a modified version of the metadata table.

```{r include = FALSE}
#ndat <- read.csv("data/ENV_pitlatrine.csv", stringsAsFactors = F)
#ndat <- ndat %>% separate(Samples, c("Country", "Latrine_Number", "Depth"), sep = "_")
#ndat$Country <- gsub("T", "Tanzania", ndat$Country)
#ndat$Country <- gsub("V", "Vietnam", ndat$Country)

#write.csv(ndat, "data/split_ENV_pitlatrine.csv", row.names=FALSE)

ndat <- read.csv("data/split_ENV_pitlatrine.csv")
```


Let's read them in and also load ggplot2.


```{r, suppressPackageStartupMessages()}
library(tidyverse)

dat <- read.csv("data/long_SPE_pitlatrine.csv", stringsAsFactors = F, header = TRUE)

```


Let's build a plot by adding components one by one to see how the grammar of graphics is implemented in ggplot2. To start, we of course need to input our data. However, if that is all we input, what we get back is a blank graphic. We have not yet said what we want to plot and how we want to plot it. 

```{r}
ggplot(ndat)
```

We have, however, created a ggplot object. This is a list of 9 parameters: data, layers, scales, mapping, theme, coordinates, facet, plot environment, and labels. Luckily there are some defaults, so we don't have to specify everything, but you can start to see how ggplot objects are highly customizable. 

```{r}
str(ggplot(ndat))
```

The next step is to choose the data we are plotting (aesthetics). At this point the data can be scaled and the axes appear. We have not yet specified how we want the data plot, just which data should be plotted. In practice, people usually omit 'mapping = ', but it is a good reminder that mapping is, in fact, what we are doing.

```{r}
ggplot(ndat, mapping = aes(x = TS, y = CODt))
```

We now have chosen the geometric object (geom) with which to plot our data, in this case a point. A geom could be a line, a bar, a boxplot - you can type geom_ and then Tab to see all of the available options. Autocomplete can also be helpful for remembering syntax.

```{r}
ggplot(ndat, aes(x=TS, y=CODt)) + geom_point() 
```

The data looks like there are two groupings. My guess would be that this is for the 2 different countries. We can easily test this by colouring our points by Country. A legend will be automatically created for you. I have also changed the shape and size of the geom. A quick reference key for shapes can be found in the 'Cookbook for R' (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/).

```{r}
ggplot(ndat, aes(x=TS, y=CODt, colour = Country)) + geom_point(shape = 18, size = 5) 
```

Some of our data points seem to be crushed near the x-axis. We can scale the y-axis to fix this. When we start customizing our plot, it easier to read our code if we have each specification on a new line. 


```{r}
ggplot(ndat, aes(x=TS, y=CODt, colour = Country)) + 
  geom_point(alpha = 0.3) + 
  scale_y_log10()
```

####Faceting

Faceting allows us to split our data into groups. Note that I have removed the colour. It is good data visualization practice to only have one attribute (colour, shading, faceting, symbols) per grouping.

```{r}
ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country)
```

I could now add information from another variable as a colour in this plot. Note that if a variable is continuous instead of discrete, the colour will be a gradient.

```{r}
ggplot(ndat, aes(x=TS, y=CODt, colour = Temp)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country)
```

If we really wanted to we could add another variable to the plot by changing the shape attribute. Note that shape can only be used for discrete values. Let's change Country to having 2 shapes and facet by the Depth of the latrine instead.

```{r}
ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth, scales = "free_y")

```

We can now note that only Tanzania had sampling greater than 4cm of depth. There are single latrines for 4 samples. There was no latrine at a depth of 11cm. Lack of replication and a bias towards Tanzania for the higher depths is something we should keep in mind while looking at this data. Depending on the question we are trying to answer, we may want to remove some of this data. We can also note that our numbers are being ordered as if they were characters. If you recall we created Depth by splitting a sample name. We can change Depth to numeric type, and the numbers will be ordered properly.

```{r}
ndat$Depth <- as.numeric(ndat$Depth)

ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth)
```

One thing that is not necessary in this case - but good to know about - is the ability to allow each grid to have its own independent axis scale. In our example, wells of Depths from 1-4cm have up to 1000 CODt, while the other wells barely have values past 100 CODt. This can be changed, but keep in mind most people will assume all grids have the same scale, so take extra care to point that the scales are different when presenting or publishing. 

```{r}
ggplot(ndat, aes(x=TS, y=CODt, colour = Temp, shape = Country)) + 
  geom_point() +
  scale_y_log10() +
  facet_wrap(~Depth, scales = "free_y")

```

####Adding Regression Lines
You can also add statistical transformations to your plots. Again, take a look at stat_ then Tab to see the list of options. In this case let's separately fit a linear regression line to CODt vs TS for each country. The grey area around the line is the confidence interval (default=0.95) and can be removed with the additional call to stat_smooth of 'se = FALSE'.

```{r}
ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point() +
  scale_y_log10() +
  facet_grid(~Country) +
  stat_smooth(method = lm)
```

A linear model is not necessarily the best fit. The method of calculating the smoothing function can be changed to other provided functions or can be a custom formula. Note that I changed the confidence interval by modifying 'level=0.8'. geoms can be made more transparent with the alpha parameter, which is set to 0.3 in the following code so that the emphasis is on the regression line rather than the points.

```{r}
ggplot(ndat, aes(x=TS, y=CODt)) + 
  geom_point(alpha = 0.3) +
  scale_y_log10() +
  facet_grid(~Country) +
  stat_smooth(method = loess, level = 0.8)
```

##Exploring different types of plots

###Density Plots

We are making a density plot for OTUs by Country. I have set alpha (transparency) to 0.3 so that we can see both countries on our plot.

```{r}
ggplot(dat, aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_density() 
```

The first thing to notice is that everything is clumped at 0. This is because we have not filtered our data frame to remove all observations where OTUs are zero. Here we filter to have at least 2 OTUs. The other thing to notice is that there is a long tail where there will only be a few observations. It will be necessary to change the x-axis to see our data. Note that R gives us a warning that we are not viewing 158 of our 4212 rows. We can add a rug geom to see each value.

```{r }
ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_density() +
  geom_rug() +
  xlim(0, 1000)

```

###Histograms

Histograms instead count the number of observations you have in each 'bin' that you specify. The default binwidth is 30. THIS HAS NOTHING TO DO WITH YOUR DATA!! CHANGE IT!! R will even warn you to change your binwidth.

```{r}
ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_histogram() +
  xlim(0,1000)
```

Instead of having the countries information stacked, we may want to see the data side by side. This can be done with the parameter 'position' set to "dodge". A rug geom can also be added to a histogram.

```{r warning = FALSE}
ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_histogram(binwidth = 50, position = "dodge") +
  xlim(0,1000) +
  ylim(0,150) +
  geom_rug()
```

***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/Pug-vs-Food-1.jpg){width=150px}

</div>

Using the data filtered for OTUs greater than or equal to two, make a violin plot of the distribution of OTUs for each country. Use a log scale. Colour the violin plots by Country. Draw lines across the violin plot where the quantiles (25th, 50th, 75th) are for each plot. What do the widths of the plots represent?


</br>
</br>
</br>

***



```{r include = FALSE, eval = FALSE}
ggplot(dat[dat$OTU >=2,], aes(x=Country, y = OTUs, fill = Country)) + 
  scale_y_log10() +
	geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) 

```




###Bar plots

Let's create a bar plot of OTUs per country and use the filled in colour to represent Taxa. 

```{r error = TRUE}
ggplot(dat, aes(x=Country, y=OTUs, fill=Taxa)) + 
	geom_bar() 
```



This is a common error because the default for geom_bar is to use the y-axis for a count. To use it for a variable instead, we have to specify stat="identity".

```{r}
ggplot(dat[dat$OTU >=2,], aes(x=Country, y=OTUs, fill=Taxa)) + 
	geom_bar(stat = "identity") 
```

####Controlling the order of your categorical variables in your legend

In the legend for the diagram above our factor levels for Taxa are in alphabetical order. However, an ordering that might be more useful would be the order that matches our data, for example, Taxa in descending order of OTUs. To do this, we can use the `forcats` package. `fct_reorder2()` takes the factor we want to order by (Taxa) and orders it by the values given (Country, OTUs). We will talk about customizing plots later in this lesson, but the last line of code is one way to remove the legend title (by making the legend title 'blank').

```{r}
library(forcats)

ggplot(dat[dat$OTU >=2,], aes(x=Country, y=OTUs, fill=fct_reorder2(Taxa, Country, OTUs))) + 
	geom_bar(stat = "identity") +
  theme(legend.title = element_blank())
 
```

Now our highest abundance Taxa is the first value in our legend, and this matches the order of the Taxa in our bar graph. The legend order now has meaning. The white line in the Vietnam bar graph is a Taxa for which there was an OTU value in Tanzania but no data in Vietnam (since anything <=2 was filtered out of the data set). 




Let's take a second to look at what happens when stat does not equal 'identity'.

```{r}
ggplot(dat, aes(x=Country, fill=Taxa)) + 
	geom_bar() 

```

What is being counted instead of OTUs? Why does the colouring for Taxa look so different?


You can alternate between stacked or "dodged" for whether your bars are on top of each other or next to each other (split by a factor or categorical variable) or on top.

```{r}
ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
	geom_bar(stat="identity", position = "dodge") 
```

You can have your bars horizonally instead of verstically by using coord_flip()l.

```{r}
ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
	geom_bar(stat="identity", position = "dodge") +
  coord_flip()

```

###Circular Plots

In ggplot, circular plots are related to bar graphs - they just have different coordinate systems. The default coordinate system is cartesian coordinates, and we need to switch to polar coordinates to make a circle. This is a Coxcomb plot - pH levels increase as you move up the outer rings, and depth increases as you move clockwise around the circle. Colour is still represented by country. 

```{r}
ggplot(ndat, aes(x=Depth, y=pH, fill = Country)) + 
	geom_bar(stat="identity", position = "dodge") +
  coord_polar()
```

To make your classic venn diagram, use theta to specify what variable is going to be used to make up the angles (width of pie slices). To wrap to a full circle instead of having sections, the width is set to one.

```{r}
ggplot(dat, aes(x="", y=Country, fill = Country)) + 
	geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y")

```



###Lines 

To draw a line graph, we select geom_line(). 

```{r}
ggplot(ndat, aes(x=Temp, y=pH)) + geom_line() 
```

We can colour the lines for the different countries. Looking at our previous graph, what must have happened when these values were over the same range?

```{r}
ggplot(ndat, aes(x=Temp, y=pH, colour = Country)) + geom_line() 
```

####Error bars

To plot error bars, we need to give the geom, geom_errorbar(), the interval over which we want to draw the bar. This is usually something like one standard deviation or one standard error or a confidence interval. Luckily, we have learned how to calculate the mean and standard deviation with dplyr. Then, all we need to do is add or subtract the standard deviation from our data point to get the bounds of the error bar. geom_errorbar() takes these bounds passed to the parameters ymax and ymin. In this case, since there were no sample replicates, the standard deviation is taken from the mean of all of the soil readings in a given country at a given depth.

```{r}
errdat <- ndat %>% group_by(Country, Depth) %>% mutate(mean_pH = mean(pH)) %>% mutate(sd_pH= sd(pH))

ggplot(errdat, aes(x=Temp, y=pH, colour = Country)) + 
  geom_line() + 
  geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = .2, alpha = 0.4)
```




###Stripplots

The first plots we made we scatterplots with 2 continuous variables. With one discrete variable and one categorical variable, we can make a stripplot.


```{r}
ggplot(dat, aes(x = Depth, y = OTUs)) + geom_point() 
```

Again, we have a lot of values crushed near the x-axis. It we add log scaling to the y-axis, we get an error that this transformation created infinite values. This is because we have zeros in our data set. Knowing this, we could ignore the warning, or add +1 to each OTU.

```{r}
ggplot(dat, aes(x = Depth, y = OTUs)) + geom_point() + scale_y_log10() 
```

```{r}
ggplot(dat, aes(x = Depth, y = (OTUs+1))) + geom_point() + scale_y_log10() 
```

In order to see the points a little better, we can 'jitter' them.

```{r}
ggplot(dat, aes(x = Depth, y = (OTUs+1))) + geom_point(position = "jitter") + scale_y_log10() + facet_wrap(~Country)
```


###Bubble Plots + Labels

Let's group our OTUs by Country and Taxa and look at the Taxa for the top 30 OTUs.

With bubble plots, we need to provide the size of our bubbles in a way that is proportional to our data, and provide a scale to map these to. 

We are going to make our points with geom_jitter so that our bubbles don't overlap. Remember that we want to specify what we are plotting with aesthetics. We want the bubbles representing OTUs_per_Taxa, so we divide each value by pi when calling size since bubbles are circles.

```{r} 
bdat <- dat %>% select(-Latrine_Number, -Depth) %>% group_by(Taxa, Country) %>% mutate(OTUs_per_Taxa = sum(OTUs)) %>% select(-OTUs)

bdat <- unique(bdat) %>% arrange(desc(OTUs_per_Taxa)) %>% .[1:30,]

ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = TRUE)  +
  guides(fill = FALSE)
```

However, it is hard to tell proportionally how different these sizes are. Mapping values to a scale gives us more of an idea of their relative sizes.

```{r}
ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  guides(fill=FALSE)
```

`ggplot2` provides 2 methods of labelling. Let's look at the help menu to find out what the difference between the 2 of them is. We have to specify in with aesthetics, that we are plotting our label. 

```{r}
ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = OTUs_per_Taxa/pi), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_label(aes(label = Taxa), show.legend = FALSE)
```

This isn't fantastic because our labels are overlapping. What parameters might you be able to use to move the labels? Try using them to get the labels to not overlap.

geom_text has an option to check for overlap of labels. geom_text does not provide a background for the label, and it may be harder to tell which label belongs to each data point.

```{r}
ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_text(aes(label = Taxa), check_overlap = TRUE, show.legend = FALSE)
```

What happened here? Why don't we have as many labels? Look in the help menu to explain this behaviour.

I don't like futzing with label positions, so I went looking for a package that would do this for me. `ggrepel()` will 'repel' your labels away from each other without getting rid of them. Let's check it out with our bubble plot labels. Install and load ggrepel. The equivalent function we can use is `geom_label_repel()`. The force parameter allow you to modify how far you want your labels pushed away from each other. Here, colour is being used to map to our data points but ggrepel can ditch the box and connect a line to data points instead. Variations on use can be found here: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html.

```{r}
library(ggrepel)

ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_label_repel(aes(label = Taxa), force = 2,show.legend = FALSE, segment.size=0)
```

###Boxplots

Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of your box are the first and third quartiles or 25th and 75th percentiles of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.

We are going to see boxplots for the distribution of OTUs per Taxa across all samples.

```{r}
ggplot(dat, aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1))
```

While it is clear that Clostridia is the most represented taxa, it is difficult to tell whether some other taxa have no representation, or if they are lowly represented. Transforming to a log scale on the y-axis will sort this out for us.

```{r}
ggplot(dat, aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10()
```

We could also clean this plot up a bit by removing those taxa with 1 or fewer OTUs. And have the condition that it must occur in more than 1 sample.

```{r}

keep <- dat %>% ungroup() %>% filter(OTUs>=2) %>% group_by(Taxa) %>% summarize(n = n()) %>% filter(n > 1) %>% select(Taxa)


ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10()
```

If we facet by country we can see that Acidobacteria_Gp1, for example, was only found in Vietnam and not Tanzania.

```{r warning = FALSE}
ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10() + facet_grid(~Country)
```

For now, let's just keep the Taxa that are common to both countries.

```{r warning = FALSE}
keep <- dat %>% 
  ungroup() %>% 
  filter(OTUs>=2) %>% 
  group_by(Country, Taxa) %>% 
  summarize(n = n()) %>% 
  filter(n > 1) 

keep$dup <- keep %>% ungroup() %>% select(Taxa) %>% duplicated(.)

keep <- keep[keep$dup, "Taxa"]

ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs)) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, hjust=1)) + scale_y_log10() + facet_grid(~Country)

```

###Beeswarm Plots

Even though boxplots give us summary statistics on our data, it is useful to be able to see where our data points are. Adding the data using `geom_point()` adds our data in the form of a stripplot to our boxplots. It would be nice to see how many points we have at each OTU value, but there are a lot of points overlapping here.

```{r warning = FALSE}
ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_point() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()
```

Jittering the points gives a general idea, but is a bit too widely dispursed to give a good sense of the data.

```{r warning = FALSE}
ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_jitter() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()
```

A beeswarm plot places data points that were overlapping instead next to each other, so we can get a better picture of the distribution of points. We simply overlay the points with `geom_beeswarm()`.

```{r warning = FALSE}
library(ggbeeswarm)

ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_beeswarm() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()
```


Increasing the spacing between data points can make this distribution a bit clearer.

```{r warning = FALSE}
ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_beeswarm(cex = 2.2) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()
```

Using `geom_quasirandom()` gives the empirical distribution of the stripplot to avoid overplotting.

```{r warning = FALSE}
ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot(alpha=0.8) + 
  geom_quasirandom(varwidth = TRUE) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()
```


Other spacing and distribution options are available at https://github.com/eclarke/ggbeeswarm.



##Customization of Plot Elements Related to Your Data

Plot elements relating to my data include things like axis labels, titles, colour or shapes that represent subsets of your data, scaling that is data-dependent, legends, and other data-driven parameters. 

For customizing you data it is possible to change:
- colour()
- fill()
- shape()
- size()
- alpha()

We have seen in the above examples that colour can be applied to discrete or continous variables. We can also use colour (shape, etc.) to represent outliers. In this data set outliers beyond the whiskers (above or below 1.5*IQR) can be coloured red.

Titles and axis labels can be added using:
- ggtitle()
- xlab()
- ylab()

In this case, I want to add a title, modify the y-axis label to say that the data is log, and remove the x-axis label because I think it is self-explanatory and I will mention Taxa in my title. Note that I remove the x-axis label by using 'NULL'.

I also want to change my Country labels from a single letter to the country name. This can be done in a couple of ways. One way would be to change the values in the dataset. Since we haven't learned string manipulation yet, I will show you a second way, which is to use the labeller function. I can make a vector of the names that I want to replace. The data is split by Country in the `facet_grid()` and this is where we pass our labels to the labeller function, which will output the names on the strip label.


```{r warning = FALSE}
labels <- c(T = "Tanzania", V = "Vietnam")

p <- ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs, fill = Taxa)) + 
  geom_boxplot(outlier.colour = "red") + 
  scale_y_log10() + 
  facet_grid(~Country, labeller = labeller(Country = labels)) +
  ggtitle("Abundance of Taxa by Country") +
  xlab(NULL) +
  ylab("log(OTUs)") + 
  guides(fill=FALSE)

p
```


A common thing to want to do is to change colours from `ggplot2`'s rainbow color scheme. Let's create our own colour palette for Taxa.

##A Note on Colour Palettes

There are 3 main types of colour palettes.
1. Sequential - implies an order to your data ie. light to dark implies low values to high values.
2. Diverging - low and high values are extremes, and the middle values are important still goes from light to dark, but 3 colour mainly used.
3. Qualitative - there is no quantitative relationship between colours. This is usually used for categorical data.

Which of these is `ggplot2`'s default color palette?

Many colour palettes now exist. I showcase a couple that work nicely with `ggplot2`. These packages also have colorblind friendly options. RColorBrew has options for these 3 types of palettes, which you can see with `display.brewer.all()`. With a smaller dataset, we could make a call to `scale_fill_brewer()`, which just requires a choice of one of RColorBrewer's palettes, such as "Spectral". However, we have 24 Taxa and these palettes have 8-12 colours, so we have to get creative. I have simply taken the 2 qualitative palettes that each have a length of 12, put them into one palette, and made sure my values were unique. This can then be passed to ggplot via `scale_fill_manual()`. You can always choose a vector of your own colors using this 'R color cheatsheet'(https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf).


```{r warning = FALSE}
library(RColorBrewer)
library(viridis)

display.brewer.all()

#p + scale_fill_brewer(palette = "Spectral")
#you can also use the names of colours
#scale_fill_manual(values=c("purple", "cornflowerblue", "grey", "yellow", "orange", "brown"))

palette1 <- brewer.pal(12, "Paired")
palette2 <- brewer.pal(12, "Set3")

length(unique(c(palette, palette2)))

custom <- c(palette1, palette2)

p + scale_fill_manual(values = custom)

p + scale_fill_viridis(discrete = TRUE)
p + scale_fill_viridis(discrete = TRUE, option = "plasma")

```


The `viridis` package also has some nice color palettes (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). I think they might all be diverging palettes (qualitative is best for our Taxa), but I will showcase a couple here.

```{r}
library(viridis)

p + scale_fill_viridis(discrete = TRUE)
p + scale_fill_viridis(discrete = TRUE, option = "plasma")
```


`RSkittleBrewer` is another option for funky colour palettes.
`ggsci` has a variety of color palettes inspired by different scientific journals as well as television shows (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html).


##Customization of Theme

As mentioned earlier, it is possible to customize every single aspect of a ggplot. Most of this occurs with a call to __theme__, which you can think of as modifying everything BUT your data. For example, my axis labels can be modified, but they (hopefully) have something to do with my data. However, changing the size of the text or the font of the labels is unrelated to my data, and that same structure could be carried over to other plots if I saved my own theme.

Things that you can change with theme include the axis, legend, panels, gridlines, or background.

Each element of theme inherits from one of: 

- element_text (text elements like font, colour, size, face (bold, italics), alignment), 
- element_line (grid lines, axis lines), 
- element_rect (panels and backgrounds - colour, size, fill), 
- element_blank (assigns nothing, usually when you are trying to get rid of something), 
- element_grob (making a grid grob).

ggplot comes with some themes - so I would suggest starting with the one that is closest to what you want, and start modifying from there.

Check out these themes:
- theme_minimal()
- theme_classic()
- theme_bw()
- theme_void()
- theme_dark()
- theme_gray()
- theme_light()

You can look at the default for each theme simply by typing it into the console.

```{r}
theme_bw
```

And this is what `theme_bw()` looks like:

```{r warning = FALSE}

p + theme_bw()


```

Here is an example of theme_dark.

```{r warning = FALSE}

p + theme_dark()
```
There are also themes that have been created and put into a package, ggthemes. Some of these themes are based off of graphs seen in print or on websites (the economist, wall street journal, fivethirtyeight) or to match standard tools (excel, google docs). Information about the themes can be found at https://github.com/jrnold/ggthemes.


```{r warning = FALSE}
library(ggthemes)
p + theme_economist()
```
```{r warning = FALSE}
p + theme_stata()
```

You can also make your own custom theme as demoed here: http://joeystanley.com/blog/custom-themes-in-ggplot2


I am going to show you how to modify a plot, starting from theme_minimal() because I don't like the grey backgrounds or harsh axis lines.


```{r warning = FALSE}
p + theme_minimal()

```


Things I don't like about this plot and their solutions:

|Problem                                                   | Solution                                                                   |
|_________________________________________________________:|:__________________________________________________________________________:|
|x-axis labels overlap                                     | rotate lables: axis.text.x = element_text(angle =90, hjust=1)              |
|country labels are smaller than axis labels               | change size and face: strip.text.x = element_text(face = "bold", size = 16)|
|title is uncentered                                       | adjust horizontally: plot.title = element_text(hjust=0.5, size = 18)       |
|border to separate countries                              | create a border: panel.border = element_rect(fill = NA)                    |
|make y axis ticks                                         | create y axis ticks: axis.ticks.y = element_line()                         |


It isn't necessary to remember all of this syntax, I frequently find myself back at the help page at http://ggplot2.tidyverse.org/reference/theme.html.

The last call to theme will override previous calls. Therefore, if we want to start with `theme_minimal()` as our base, it has to be in our code BEFORE the other modifications. 

```{r warning = FALSE}
ggplot(dat[dat$Taxa %in% keep$Taxa,] , aes(x = Taxa, y = OTUs, fill = Taxa)) + 
  geom_boxplot(outlier.colour = "red") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle =90, hjust=1), 
        panel.border = element_rect(fill = NA), 
        strip.text.x = element_text(face = "bold", size = 16), 
        plot.title = element_text(hjust=0.5, size = 18), 
        axis.ticks.y = element_line()) + 
  scale_y_log10() + 
  facet_grid(~Country, labeller = labeller(Country = labels)) +
  ggtitle("Abundance of Taxa by Country") +
  xlab(NULL) +
  ylab("log(OTUs)") + 
  guides(fill=FALSE)

```

***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/Pug-vs-Food-1.jpg){width=150px}

</div>

Change the colour of the plot background. Add minor gridlines, and make the major gridlines black. Increase the spacing in between the panels.


</br>
</br>
</br>

***

###Correlation Plots and Heatmaps

One common way to look for outliers or trends in data is to calculate the correlation between samples and make a correlation plot. Since samples are expected to be in columns, this is an example of where you would use 'wide' data. The default correlation coefficient is the Pearson correlation, but others (Kendall, Spearman) can be specified. Note that the entire matrix must be numeric. 

A heatmap can be made in ggplot by using the tile geom. The fill is now the correlation coefficient, which is a continuous variable (hence the gradient). I have changed the title of the legend from 'value' to 'Pearson Correlation' by using `scale_fill_continuous()`. The '\n' included in the name is a line break. A fantastic resource for modifying legends is http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/.

```{r warning = FALSE}
cdat <- read.csv("data/ENV_pitlatrine.csv", row.names = 1)

cdat <- cor(cdat)

library(reshape2)
melt <-melt(cdat)

ggplot(melt, aes(Var1, Var2, fill = value)) + 
  geom_tile() +
  xlab(NULL) +
  ylab(NULL) +
  scale_fill_continuous(name = "Pearson\nCorrelation")

```

```{r warning = FALSE}
cdat <- read.csv("data/SPE_pitlatrine.csv", row.names = 1)

cdat <- cor(cdat)

library(reshape2)
melt <-melt(cdat)

ggplot(melt, aes(Var1, Var2, fill = value)) + 
  geom_tile() +
  xlab(NULL) +
  ylab(NULL) +
  scale_fill_continuous(name = "Pearson\nCorrelation") +
  theme(axis.text.x = element_text(angle =90, hjust=1)) 
  
```


If this looks a little jarring to you compared to 'normal' heatmaps, it is probably because the data has not been ordered yet. This can be done by calculating the distance between samples (ie. euclidean) and them clustering them in a hierarchical way (ie. complete linkage).

To have a different colour scheme that for example, goes from red to yellow, you can use `scale_fill_gradient()` and specify what you want the diverging colours to be. Note that your legend title now derives from this gradient. 


```{r}
dd <- dist(cdat)
hc <- hclust(dd)
cdat <- cdat[hc$order, hc$order]

melt <-melt(cdat)

ggplot(melt, aes(Var1, Var2, fill = value)) + geom_tile() +
  scale_fill_gradient(name = "Pearson\nCorrelation", low = "red", high = "yellow") +
  xlab(NULL) +
  ylab(NULL)+
  theme(axis.text.x = element_text(angle =90, hjust=1, size =5), legend.title = element_text(size=10), axis.text.y = element_text(size=5)) 

```

We can now see that there is a patch of dark red samples with low correlation. It is difficult to see which sites these are as the names are a bit small. 

Enter the interactive heatmap! `Plotly` allows for interactive plots in a variety of languages, and allows the creation of sharable links (if you are interested you have to sign up for a free account to get API credentials https://plot.ly/r/getting-started/). They have created a package, which, luckily for us, interfaces seamlessly with `ggplot2`. I only need to save my plot and call the function `ggplotly()` to have an interactive graph! 

The response of plotly to manipulations other than hovering can be a bit slow if not using the API and may freeze/crash R, so I recommend saving your code before trying it.  


```{r warning = FALSE}
library(plotly)
p <- ggplot(melt, aes(Var1, Var2, fill = value)) + geom_tile() +
  scale_fill_gradient(name = "Pearson\nCorrelation", low = "red", high = "yellow") +
  xlab(NULL) +
  ylab(NULL)+
  theme(axis.text.x = element_text(angle =90, hjust=1), legend.title = element_text(size=10)) 

ggplotly(p)

```

Hovering will highlight the variables and values relating to that specific data point, and areas can be zoomed in on. We can now see the names of the low correlation samples. While they are all from Tanzania, they do not appear to all be from the same Latrine_Number or Depth. 

`ggvis` is a Bioconductor package allowing interactive graphics in R (http://ggvis.rstudio.com/). It is founded on the grammar of graphics and the syntax is similar, but not identical to `ggplot2`. It is good for interactive scatterplots, bar graphs, line graphs and histograms (although not heatmaps at this point in time). The nature of the interactiveness is highly customizable: you can hover, add slider bars, checkboxes or radiobuttons (it uses Shiny's reactive programming).

In this example I have just taken a scatterplot that we made earlier, and made it possible to hover over a point and see its underlying data. Don't worry about the function and syntax too much - those will be introduced in later lessons. I just wanted to show you another option for interactive visualization that hopefully isn't too intimidating. 

```{r warning = FALSE}
library(ggvis)

p <- ggvis(ndat, x= ~TS, y= ~CODt) 
  
layer_points(p) %>% add_tooltip(function(x) c(paste0("TS=", x$TS, ", CODT=", x$CODt)))

```
###Saving in ggplot2

ggplot2 has its own function for saving ggsave(). 

You can send the plot object to the graphics device and then save that image. 

With ggsave you can minimally input the filename you would like to have, and the path to your file (you can instead combine the path and filename in the filename).

```{r}
ggplot(melt, aes(Var1, Var2, fill = value)) + geom_tile() +
  scale_fill_gradient(name = "Pearson\nCorrelation", low = "red", high = "yellow") +
  xlab(NULL) +
  ylab(NULL)+
  theme(axis.text.x = element_text(angle =90, hjust=1, size =5), legend.title = element_text(size=10), axis.text.y = element_text(size=5)) 

ggsave("heatmap.png", path = "img")
```
However, in some cases you want to tailor your output. You can specify the width, height and units of your image, or you can apply a scaling factor. You can also specify the plot object you want to save instead of whatever was on your graphics device last using the 'plot' parameter. ggsave() can save in a variety of file extensions including pdf, jpeg, tiff, bmp, svg or png.

```{r}
hmap <- ggplot(melt, aes(Var1, Var2, fill = value)) + geom_tile() +
  scale_fill_gradient(name = "Pearson\nCorrelation", low = "red", high = "yellow") +
  xlab(NULL) +
  ylab(NULL)+
  theme(axis.text.x = element_text(angle =90, hjust=1, size =5), legend.title = element_text(size=10), axis.text.y = element_text(size=5)) 

ggsave("heatmap.pdf", plot = hmap, path = "img", scale = 2, width = 150, height = 110, units = "mm")
```

  
##Taking it up a notch


###Multiple plots on one page (ie. for publication images)

There are a variety of methods to mix multiple graphs on the same page, however `ggplot` does not work well with all of them. I am going to work with a package based that uses `gridExtra`(which allows us to arrange plots) but works well with `ggplot` called `ggpubr` (which allows us to align the axes of our plots). For a demonstration, we are going to take 3 plots that we made earlier (a boxplot, a histogram, and a bubble plot), save them as objects, and then arrange and align them in the same figure. (http://www.sthda.com/english/rpkgs/ggpubr/)

There is a function `ggarrange()` which takes your plots, their labels, and how they are arranged in rows and columns. To start, if we just have our boxplot and bubble plot, and we wanted them to be side by side, if you picture each plot as a square in a grid, we would have two columns `ncol = 2` and 1 row 

```{r warning = FALSE}
library(ggpubr)

hist <- ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_histogram(binwidth = 50, position = "dodge") +
  xlim(0,1000) +
  ylim(0,150) +
  geom_rug() 

bub <- ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_text(aes(label = Taxa), check_overlap = TRUE, show.legend = FALSE)

ggarrange(bub, hist, 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)


```
While our grid areas are the same size, our plot backgrounds are not. Let's adjust the legend so that it is in the top right corner of the plot, and get remove the white background. We do not need the alpha legend. Do you remember how to get rid of it?


```{r warning = FALSE}
hist <- ggplot(dat[dat$OTU >=2,], aes(x=OTUs, fill=Country, alpha=0.3)) + 
	geom_histogram(binwidth = 50, position = "dodge") +
  xlim(0,1000) +
  ylim(0,150) +
  geom_rug() +
  guides(alpha = FALSE) +
  theme(legend.justification=c(1,1), legend.position=c(1,1), legend.background = element_blank())

ggarrange(bub, hist, 
          labels = c("A", "B"),
          ncol = 2, nrow = 1)
```
Imagine a square with 4 boxes. We are going to put our boxplot in the left top and bottom boxes, the histogram in the top right box, and the bubble plot in the bottom right box.  To do this, we are arranging 2 columns (one with the boxplot and one with the histogram + boxplot) and we are arranging 2 rows (one with the histogram and one with the boxplot).

```{r warning = FALSE}
box <- ggplot(dat[(dat$Taxa=="Clostridia" | dat$Taxa == "Gammaproteobacteria" | dat$Taxa == "Unknown" | dat$Taxa == "Bacilli") & dat$Country == "T",] , aes(x = Taxa, y = OTUs)) + 
  geom_boxplot() + 
  geom_point() +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  scale_y_log10()

ggarrange(box, ggarrange(hist, bub, 
          labels = c("B", "C"),
          nrow = 2),
           ncol = 2, labels = "A")

```
The y-axis lines for the histogram and boxplot are not aligned, but this can be fixed with a call to `align="v"`.


```{r warning = FALSE}
ggarrange(box, ggarrange(hist, bub, 
          labels = c("B", "C"),
          nrow = 2, align = "v"),
           ncol = 2, labels = "A")

```
To make sure all axis titles are the same size (B and C look a little off to me), we can specify the font we want changed and the size we want to change it to. I am also going to make the legend title size the same. I want to make the bubble plot font a little smaller as well. Let's look at the font function. Can I do that? What might be another option?
```{r warning = FALSE}
bub <- ggplot(bdat, aes(x = Country, y = OTUs_per_Taxa, fill=Taxa)) +
  scale_y_log10() +
  geom_jitter(aes(size = sqrt(OTUs_per_Taxa/pi)), pch = 21, show.legend = FALSE) + 
  scale_size_continuous(range=c(1,30)) +
  geom_text(aes(label = Taxa, size=16), check_overlap = TRUE, show.legend = FALSE)


ggarrange(box + font("axis.title", size=9), ggarrange(hist + font("axis.title", size=9) + font("legend.title", size=9), bub + font("axis.title", size=9), 
          labels = c("B", "C"),
          nrow = 2, align = "v"),
           ncol = 2, labels = "A")
```

***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/Pug-vs-Food-1.jpg){width=150px}

</div>

Using the code available here for a line graph we made earlier in the lesson, make a combined figure. Imagine a square with 4 boxes. We are going to put our line graph in the top right and left boxes, the histogram in the bottom left box, and the bubble plot in the bottom right box. Make sure the legend and axis titles are the same size. Change the legend text for the line graph to be smaller than the legend title. 

```
lin <- ggplot(errdat, aes(x=Temp, y=pH, colour = Country)) + 
  geom_line() + 
  geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = .2, alpha = 0.4)
```

</br>
</br>
</br>

***

```{r include = FALSE, warning = FALSE, eval = FALSE}
lin <- ggplot(errdat, aes(x=Temp, y=pH, colour = Country)) + 
  geom_line() + 
  geom_errorbar(aes(ymin=pH-sd_pH, ymax = pH+sd_pH), width = .2, alpha = 0.4)


ggarrange(lin+ font("axis.title", size=9) + font("legend.title", size=9) + font("legend.text", size = 8), ggarrange(hist + font("axis.title", size=9) + font("legend.title", size=9), bub + font("axis.title", size=9), 
          labels = c("B", "C"),
          ncol = 2, align = "h"),
           nrow = 2, labels = "A")
```



There are many fantastic R packages to analyze and visualize _your_ data. As a group, we are likely working on a variety of specialized areas. The plots we have made so far today should be useful for data exploration for many different kinds of data. In the next section we are going to preview some more complex visualization types, but since these take more time to go through and not everyone may be interested in network diagrams, time-series analysis, or geospatial data, we will not be plotting all of these ourselves. If you are interested in a tutorial on one of these visualization types, please indicate that in the comments area of the lesson survey (https://www.surveymonkey.com/r/VNQZ3KS). 

We will, however do an Upset Plot as my personal crusade to have something better than venn diagrams in publications.


###Interactive graphics 

Plotly:
- https://plot.ly/r/

ggvis:
- http://ggvis.rstudio.com/interactivity.html

Heatmaps:
- https://github.com/talgalili/heatmaply

Interactive time-series data:
- https://rstudio.github.io/dygraphs/
###Network diagrams 
visNetwork (based on igraph)
  https://datastorm-open.github.io/visNetwork/edges.html

###Circos Plots
Migest
  https://gjabel.wordpress.com/category/r/migest/

###Upset Plots
- UpSetR https://github.com/hms-dbmi/UpSetR

Upset plots are an alternative to venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, venn diagrams can be difficult to interpret for greater than 2 or 3 sets.

Let's see how this works practically. The question that we are asking is how many sites are each Taxa represented at and what is the overlap with other Taxa? The data that we have is an OTU table. For this purpose we simply want to know whether a given Taxa was present or absent in the sample, and then we can form intersections based on this information. So, for each OTU value that is not 0, we replace it with a 1 instead. Then all we have to do is to enter our data set, the number of sets we are inputting, if we want to order the results (in this case by frequency), and how many intersections we want to show. I have picked 8 different Taxa, which would make a crazy venn diagram with a lot of zeros in it since a couple of these Taxa are rare. Here, I will show 15 intersections - we know the remaining intersections would be zero since this is ordered by frequency.

```{r}
library(UpSetR)

subdat <- read.csv("data/SPE_pitlatrine.csv", stringsAsFactors = FALSE)

test <- subdat %>% 
   rownames_to_column %>% 
   gather(var, value, -rowname) %>% 
   spread(rowname, value) 


colnames(test) <- test[30,]
test <- test[-30,]

test[,-1] <- apply(test[,-1], 2, function(x) ifelse(as.numeric(x)!=0, 1, 0))
test <- test[, c(1,4,9,17,22,19,20,33,35)]


upset(test, nsets = 8, empty.intersections = TRUE, order.by = "freq", nintersects = 15, main.bar.color = "black", sets.bar.color = "#56B4E9" )

```
Let's look at our greatest Intersection Size (equal to 28). This means that 28 of our 81 samples have Clostridia, Alphaproteobacteria, and Deinococci present in the same sample WITHOUT Chloroflexi, both Acidobacteria, Cyanobacteria or Lentisphaeria also being present. We can see from the Set Size that Clostridia and Alphaproteobacteria are present in almost all samples, and Deinococci is present in greater than 3/4 of the samples. You might expect this overlap to be higher, but remember that this is without the other Taxa and moving along our interesection sizes and dot matrix, we can see that other intersections include these bacteria. Some quick mental math shows that 63 samples have these 3 bacteria present.  

###Geospatial Data
Static Maps:
- https://bhaskarvk.github.io/user2017.geodataviz/notebooks/02-Static-Maps.nb.html

Interactive Maps:
- https://bhaskarvk.github.io/user2017.geodataviz/notebooks/03-Interactive-Maps.nb.html
  
###Phylogenetics Data 
ggtree: 
- http://www.bioconductor.org/packages/3.7/bioc/vignettes/ggtree/inst/doc/treeAnnotation.html#annotate-clades

treeman: 
- https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-2340-8

metacoder: 
- https://github.com/grunwaldlab/metacoder

phyloseq:
- https://joey711.github.io/phyloseq/index.html

###Genomics Data 
ggbio: 
- http://www.bioconductor.org/packages/2.11/bioc/vignettes/ggbio/inst/doc/ggbio.pdf

GenVisR:
- https://bioconductor.org/packages/release/bioc/vignettes/GenVisR/inst/doc/GenVisR_intro.html

GenomeGraphs:
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-2
   
***
__Challenge__ 


<div style="float:left;margin:0 10px 10px 0" markdown="1">
![](img/Pug-vs-Food-1.jpg){width=150px}

</div>

Install and load the gapminder package. Save the gapminder data into an object using `dat <- gapminder`. Plot the following:

  1. a qualitative variable with a quantitative variable (at least 2 ways). 
    + Facet your plot by the qualitative variable.
  2. two quantitative variables (at least 2 ways). 
    + Add a regression line to your plot.

</br>
</br>
</br>

***


__Resources:__     

Wickham, Hadley. (2010). A Layered Grammar of Graphics. Journal of Computational and Statistical Graphics.     
Wilkinson, L. (2005), The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer. [14, 18]          
http://www.cookbook-r.com/Graphs/     
https://github.com/jennybc/ggplot2-tutorial     
Tufte, Edward R. The Visual Display of Quantitative Information.      
http://stcorp.nl/R_course/tutorial_ggplot2.html     
http://ggplot2.tidyverse.org/reference/theme.html
http://joeystanley.com/blog/custom-themes-in-ggplot2
https://github.com/jrnold/ggthemes
https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/
https://github.com/eclarke/ggbeeswarm
https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
http://www.sthda.com/english/rpkgs/ggpubr/
https://rpubs.com/drsong/9575
http://elpub.bib.uni-wuppertal.de/edocs/dokumente/fbb/wirtschaftswissenschaft/sdp/sdp15/sdp15006.pdf
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/


#Post-Lesson Assessment
***

Your feedback is essential to help the next cohort of trainees. Please take a minute to complete the following short survey:
https://www.surveymonkey.com/r/VNQZ3KS

</br>

***

</br>

Thanks for coming!!!

![](img/rstudio-bomb.png){width=300px}